Skip to content

Conversation

@ORippler
Copy link
Contributor

@ORippler ORippler commented Oct 29, 2025

This PR hides latency of bias and gate-loading for fused mul_mat_vec_q kernel by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive in this part of the kernel still, the warp scheduler has enough threads available to effectively hide the cost of loading those two single floats.

This gives 3-14% E2E speed-up for gpt-oss models (qwen3moe does not use bias and gate, and I am unaware of any other MoE model that uses bias and gate which I could run E2E perf tests on). The kernel themselves are up to 20% faster for gpt-oss.

GPU Model Test t/s master t/s this branch (babfd19) Speedup
RTX 4000 SFF Ada gpt-oss 20B MXFP4 MoE tg128 85.41 88.20 1.03
RTX 4000 SFF Ada gpt-oss 20B MXFP4 MoE tg256 85.66 88.57 1.03
RTX 4000 SFF Ada gpt-oss 20B MXFP4 MoE tg512 84.80 87.62 1.03
RTX 6000 Ada gpt-oss 20B MXFP4 MoE tg128 240.55 248.09 1.03
RTX 6000 Ada gpt-oss 20B MXFP4 MoE tg256 245.03 253.23 1.03
RTX 6000 Ada gpt-oss 20B MXFP4 MoE tg512 242.32 250.41 1.03
RTX PRO 4500 BW gpt-oss 20B MXFP4 MoE tg128 212.97 223.10 1.05
RTX PRO 4500 BW gpt-oss 20B MXFP4 MoE tg256 220.06 236.34 1.07
RTX PRO 4500 BW gpt-oss 20B MXFP4 MoE tg512 218.21 237.13 1.09
RTX PRO 6000 BW Max-Q gpt-oss 120B MXFP4 MoE tg128 198.64 208.72 1.05
RTX PRO 6000 BW Max-Q gpt-oss 120B MXFP4 MoE tg256 222.68 238.11 1.07
RTX PRO 6000 BW Max-Q gpt-oss 120B MXFP4 MoE tg512 224.01 240.44 1.07
RTX PRO 6000 BW Max-Q gpt-oss 20B MXFP4 MoE tg128 296.58 338.69 1.14
RTX PRO 6000 BW Max-Q gpt-oss 20B MXFP4 MoE tg256 314.90 356.62 1.13
RTX PRO 6000 BW Max-Q gpt-oss 20B MXFP4 MoE tg512 329.31 353.40 1.07

This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
@TinyServal
Copy link

Fixes #16815, benchmarks for affected devices (ampere cards with low memory bandwidth) can be found in the comments.

Results on the RTX A4000:

Model Test t/s master (b9ce940) t/s #16847 (babfd19) Speedup
gpt-oss 20B MXFP4 MoE tg128 113.70 118.39 1.04
gpt-oss 20B MXFP4 MoE tg256 112.42 116.77 1.04
gpt-oss 20B MXFP4 MoE tg512 109.96 114.29 1.04

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@am17an am17an merged commit 8b11dee into ggml-org:master Oct 30, 2025
71 of 72 checks passed
ORippler added a commit to ORippler/llama.cpp that referenced this pull request Oct 30, 2025
Pointed out
[here](ggml-org#16847 (comment))
that only a single value is needed per target col per thread
am17an pushed a commit that referenced this pull request Nov 1, 2025
* CUDA: Remove unneded bias/gate dims in fused mmvq

Pointed out
[here](#16847 (comment))
that only a single value is needed per target col per thread

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix "Error 991-D: extra braces are nonstandard" during compilation

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Nov 2, 2025
…is resolved

revert ggml-org#16715 (+2 squashed commit)

Squashed commit:

[289af2ee2] Revert "Hide latency of bias and gate-loading (ggml-org#16847)"

This reverts commit 8b11dee.

[a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (ggml-org#16807)"

This reverts commit 463bbf2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants